relative positional encoding
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention.
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention.
Structure-informed Positional Encoding for Music Generation
Agarwal, Manvi, Wang, Changhong, Richard, Gaël
Music generated by deep learning methods often suffers from a lack of coherence and long-term organization. Yet, multi-scale hierarchical structure is a distinctive feature of music signals. To leverage this information, we propose a structure-informed positional encoding framework for music generation with Transformers. We design three variants in terms of absolute, relative and non-stationary positional information. We comprehensively test them on two symbolic music generation tasks: next-timestep prediction and accompaniment generation. As a comparison, we choose multiple baselines from the literature and demonstrate the merits of our methods using several musically-motivated evaluation metrics. In particular, our methods improve the melodic and structural consistency of the generated pieces.
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
HyPE: Attention with Hyperbolic Biases for Relative Positional Encoding
In Transformer-based architectures, the attention mechanism is inherently permutation-invariant with respect to the input sequence's tokens. To impose sequential order, token positions are typically encoded using a scheme with either fixed or learnable parameters. We introduce Hyperbolic Positional Encoding (HyPE), a novel method that utilizes hyperbolic functions' properties to encode tokens' relative positions. This approach biases the attention mechanism without the necessity of storing the $O(L^2)$ values of the mask, with $L$ being the length of the input sequence. HyPE leverages preliminary concatenation operations and matrix multiplications, facilitating the encoding of relative distances indirectly incorporating biases into the softmax computation. This design ensures compatibility with FlashAttention-2 and supports the gradient backpropagation for any potential learnable parameters within the encoding. We analytically demonstrate that, by careful hyperparameter selection, HyPE can approximate the attention bias of ALiBi, thereby offering promising generalization capabilities for contexts extending beyond the lengths encountered during pretraining. The experimental evaluation of HyPE is proposed as a direction for future research.
Modeling Time-Series and Spatial Data for Recommendations and Other Applications
With the research directions described in this thesis, we seek to address the critical challenges in designing recommender systems that can understand the dynamics of continuous-time event sequences. We follow a ground-up approach, i.e., first, we address the problems that may arise due to the poor quality of CTES data being fed into a recommender system. Later, we handle the task of designing accurate recommender systems. To improve the quality of the CTES data, we address a fundamental problem of overcoming missing events in temporal sequences. Moreover, to provide accurate sequence modeling frameworks, we design solutions for points-of-interest recommendation, i.e., models that can handle spatial mobility data of users to various POI check-ins and recommend candidate locations for the next check-in. Lastly, we highlight that the capabilities of the proposed models can have applications beyond recommender systems, and we extend their abilities to design solutions for large-scale CTES retrieval and human activity prediction. A significant part of this thesis uses the idea of modeling the underlying distribution of CTES via neural marked temporal point processes (MTPP). Traditional MTPP models are stochastic processes that utilize a fixed formulation to capture the generative mechanism of a sequence of discrete events localized in continuous time. In contrast, neural MTPP combine the underlying ideas from the point process literature with modern deep learning architectures. The ability of deep-learning models as accurate function approximators has led to a significant gain in the predictive prowess of neural MTPP models. In this thesis, we utilize and present several neural network-based enhancements for the current MTPP frameworks for the aforementioned real-world applications.
- Europe > United Kingdom (0.13)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Virginia (0.04)
- (17 more...)
- Research Report > Promising Solution (1.00)
- Overview (1.00)
- Research Report > New Finding (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP
Wang, Zhicai, Hao, Yanbin, Gao, Xingyu, Zhang, Hao, Wang, Shuo, Mu, Tingting, He, Xiangnan
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks, and become the main competitor of CNNs and vision Transformers. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. However, the heavily parameterized token-mixing layers naturally lack mechanisms to capture local information and multi-granular non-local relations, thus their discriminative power is restrained. To tackle this issue, we propose a new positional spacial gating unit (PoSGU). It exploits the attention formulations used in the classical relative positional encoding (RPE), to efficiently encode the cross-token relations for token mixing. It can successfully reduce the current quadratic parameter complexity $O(N^2)$ of vision MLPs to $O(N)$ and $O(1)$. We experiment with two RPE mechanisms, and further propose a group-wise extension to improve their expressive power with the accomplishment of multi-granular contexts. These then serve as the key building blocks of a new type of vision MLP, referred to as PosMLP. We evaluate the effectiveness of the proposed approach by conducting thorough experiments, demonstrating an improved or comparable performance with reduced parameter complexity. For instance, for a model trained on ImageNet1K, we achieve a performance improvement from 72.14\% to 74.02\% and a learnable parameter reduction from $19.4M$ to $18.2M$. Code could be found at https://github.com/Zhicaiwww/PosMLP.
Relative Positional Encoding
In this post, we will take a look at relative positional encoding, as introduced in Shaw et al (2018) and refined by Huang et al (2018). This is a topic I meant to explore earlier, but only recently was I able to really force myself to dive into this concept as I started reading about music generation with NLP language models. This is a separate topic for another post of its own, so let's not get distracted. Let's dive right into it! If you're already familiar with transformers, you probably know that transformers process inputs in parallel at once.
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
Luo, Shengjie, Li, Shanda, Cai, Tianle, He, Di, Peng, Dinglan, Zheng, Shuxin, Ke, Guolin, Wang, Liwei, Liu, Tie-Yan
The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention. Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT). With FFT, our method achieves $\mathcal{O}(n\log n)$ time complexity. Interestingly, we further demonstrate that properly using relative positional encoding can mitigate the training instability problem of vanilla kernelized attention. On a wide range of tasks, we empirically show that our models can be trained from scratch without any optimization issues. The learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long-sequence regime.
Relative Positional Encoding for Transformers with Linear Complexity
Liutkus, Antoine, Cífka, Ondřej, Wu, Shih-Lun, Şimşekli, Umut, Yang, Yi-Hsuan, Richard, Gaël
Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present Stochastic Positional Encoding as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.46)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)